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Abstract 


Metadata disorder and unnecessary costs are increasing due to the expanding population of 
scientific data schemes and standards. Metadata challenges are reviewed; and Sealce', a 
community driven metadata vocabulary application, is introduced as a potential solution. Sealce 
functions and development challenges are presented. CAMP-4-DATA participants are called 
upon to experiment with the Sealce application and actively participate in a discussion targeting 
noted metadata challenges. 


The Problem: Duplicative Metadata Efforts 


Metadata is essential for managing research data. Scientists, data managers, and the full range 
of data information systems (e.g., repositories, grid computing, and cloud resources) rely on 
metadata to operate effectively. Today, driven by the digital data deluge, we find a plethora of 
discipline-oriented metadata standards supporting the same or similar functions (Willis, et al, 
2012). For example, basically all descriptive metadata standards support discovery via topical 
subject terms/keywords; some include more granular properties for spatial and temporal data. 
Efforts establishing property semantics and defining content are duplicated time-and-time again, 
resulting in schemes that have marginal if any difference. The population of metadata standards 
that has emerged presents a disorder and cost concern, particularly given the overlap in supported 
functionalities. 


Clearly overlap among metadata schemes aids interoperability, specifically data exchange and 
cross-system searching. Benefits aside, duplicative efforts incur unnecessary costs realized via the 
following: 


e Metadata requires human and financial resources (Russom, 2010; Greenberg, et al, 
e 2013). 


e Intellectual demand and system development incur costs when aiming for metadata 
interoperability. 





' The Sealce project is also known as YAMZ (Yet Another Metadata Zoo). This change was instituted 
following the presentation of this work at the CAMP-4-DATA. 
* Dublin Core Application Profiles: http://dublincore.org/documents/profile-guidelines/. 
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e Extending an existing scheme with new properties increases metadata costs. 


Dublin Core Metadata Application Profiles (DCAPs)’ and linked open data (LOD) can, on some 
level, help circumvent duplication and cost by leveraging existing metadata work. An approach 
built around virtual and social communities of practice may provide a complementary and 
alternative way to address these challenges. 


The DataONE Preservation and Metadata Working Group (PAMWG)’ advocates for a social 
approach to metadata vocabulary design. PAMWG has prototyped a metadictionary called 
Sealce* that uses crowdsourcing for establishing metadata terms and engaging metadata 
stakeholders. The remainder of this paper introduces Sealce, documents current features and 
goals, and discusses next steps. The last section of the paper calls upon CAMP-4-DATA 
participants to experiment with Sealce and engage in a discussion to address metadata challenges. 
The DataONE Preservation and Metadata Working Group (PAMWG) advocates for a social 
approach to metadata vocabulary design. PAMWG has prototyped a metadictionary called 
Sealce3 that uses crowdsourcing for establishing metadata terms and engaging metadata 
stakeholders. The remainder of this paper introduces Sealce, documents current features and 
goals, and discusses next steps. The last section of the paper calls upon CAMP-4-DATA 
participants to experiment with Sealce and engage in a discussion to address metadata challenges. 


Introducing SeasIce: Context for a Crowsourced Metadictionary 


Sealce Context 
The Sealce metadictionary is being developed to host community-driven metadata terms and 
definitions. Chief goals include reducing duplicative metadata activity and unifying metadata 


practices across disciplines. Functional requirements are presented in Table 1. 


Table 1. Functional Requirements (Greenberg, et al, 2012) 





Low barrier for contributions. 





Transparency in the review process. 





Collective team review, with rotating responsibilities 
among community members (scientists, developers, 
organizations, curators, etc.) 





Consideration of elders (experts) to guide the review 
process and maintain thoughtful, balanced discussion. 





Voting capacity of all users on the candidacy of terms 
submitted and their use. 





Collective ownership of any user or organization. 





Stakeholder engagement in the design and review process. 











DataONE serves as the target implementation community, although Sealce has implications for 
any domain seeking to reduce duplicative efforts. DataONE is an ideal environment for launching 
Sealce given the range of disciplines represented (e.g., ecology, biology, geology, astronomy, 
etc., and the many sub-disciplines) and the diversity of metadata stakeholders (data creators, 
curators, system developers, and administrators). 





* Dublin Core Application Profiles: http://dublincore.org/documents/profile-guidelines/. 

? DataONE Preservation and Metadata Working Group: 

http://www.dataone.org/working groups/data-preservation-metadata-and-interoperability- working-group. 
* Sealce Metadictionary: http://seaice.herokuapp.com/. 

$ DataONE: http://www.dataone.org/. 


Proc. Int’l Conf. on Dublin Core and Metadata Applications 2013 


DataONE is a community and a distributed framework providing steps toward a sustainable 
cyberinfrastructure. The Sealce metadata dictionary supports this overriding goal by exploring an 
innovative means for a persistent and robust metadata infrastructure (Kunze, et al, 2013). By 
utilizing crowdsourcing techniques, the Sealce metadictionary can help eliminate duplicative 
efforts, reduce associated costs, and provide an innovative framework for metadata 
interoperability across disciplines for stakeholder communities. The aim is a ‘high-quality social 
ecosystem’ in which the community of metadata stakeholders dialog, confirm terms and 
definitions, and unify metadata practices. 


Sealce—Prototype and Framework 


Sealce is modeled on StackOverflow® and other social software services. Figure 1 presents 
the Sealce homepage. 





e herokuapp.com > CB c 


jst Visited ~ (@) Getting Started ÇJ Latest Headlines ~ ÈJ ezid feed ~ CJ n2t feed ~ ÇJ ezidstatus ~ 1 Bookmarks ~ 





ei prone Add About Const Satine 


Metadictionary 


A crowd sourced metadata dictionary. Search for terms, upvote useful ones. 





FIG. 1. Sealce Homepage 


When logged in, users may vote terms ‘up’ or ‘down’ based on the definition and other aspects 
of importance; engage in online discussions about a term/definition/use, etc.; and propose new 
term(s) for discussion and voting. Figure 2, shows voting activity for a series of terms. 


Browse dictionary 


high score | recent | volatile | stable | alphabetical 


Term Consensus 
data 

publisher 

creator 

datum 

description 
identifier 
metadata 
resource 
identifier 

datum 

hydraulic gradient 
structured data 
structured datum 
talus slope 

great 

CHL 

metadatum 


Contributed by Last modified 
John Kunze 1 day ago 
John Kunze 5 days ago 
John Kunze 6 days ago 
John Kunze 1 day ago 
John Kunze 6 days ago 
John Kunze 1 day ago 
John Kunze 6 days ago 
Chris Patton 2 days ago 
John Kunze 5 days ago 
Chris Patton 1 day ago 
Angela Murillo 14 August 2013 
John Kunze 9 August 2013 
John Kunze 5 days ago 
‘Angela Murillo 12 August 2013 
Nassib Nassar 6 days ago 
Greg Janée 6 days ago 
John Kunze 12 August 2013 


TETTI o 
APEERE 5 
Sseges 8 
2 ®2oo oo Ln! 
SERERE 
Seeeas 
gegege 


token - % deprecated John Kunze 1 day ago 
talus - % deprecated Angela Murillo 6 days ago 





FIG. 2. Browse View/Voting scores for terms 


Modeled on StackOverflow, users may modify or delete their term and definition at any time. 
Once this occurs, those who have voted on the term will be notified. In addition, Sealce provides 
listings of newly submitted terms, highly-rated terms, and highly-stable terms in order to guide 





6 StackOverflow: http://stackoverflow.com/. 
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users on which terms are ready for discussion and voting. Work is under way for Sealce to 
provide a search mechanism that ranks highly-rated and highly-stable results. 


Sealce Features and Ongoing Development 


Sealce metadictionary presents a number of unique challenges not presented in other 
crowdsourcing environments. There are many social network systems rely on voting or ranking 
of answers. Sealce is unique in accommodating a wide-array of stakeholders—data creators, 
curators, developers, administrators—anyone with a vested interest in metadata. The community 
of practice is quite diverse. Additionally, social technology is being used in Sealce to identify a 
set of stable canonical terms; and these terms will form a common metadata practice specific to 
scientific data. This process must be fully automated and must reflect the consensus of the full 
stakeholder community. A central problem is that it is unlikely that every user will vote on every 
term. The PAMWG is exploring a heuristic for consensus based on user reputation. This 
heuristic involves stability, class order of term, and voting impacts. Ideas surrounding the 
heuristic functionality and Sealce in general are captured in an open blog.” The percentages and 
time intervals presented directly below reflect truly preliminary considerations. 


Stability 


A term is considered stable if it meets two criteria: (1) the definition or term itself haven’t been 
edited by the owner for some predefined period of time, and (2) the rate of change of the score 
drops below a certain threshold close to zero 


Classes 
Sealce has designated three term classes: 
e Canonical - the set of stable terms with consensus over 75%. 


e Deprecated - the set of stable terms with consensus under 25%. In the case that there is 
suitable replacement somewhere in the dictionary, we expect it will be standard practice 
to reference it in the deprecated term’s definition. 


e Vernacular - the set of unstable terms that cannot be classified as canonical or deprecated 
(unstable.) 


Voting and scoring 


A Sealce user may cast a single up or down vote on a particular term and they are permitted to 
change it at any time. Table 2 shows potential ways in which term classes may change. The 
weight of the vote is based on the ratio of his or her reputation to the sum of reputations of all 
users voting on the term. As the number of voters increases, the weights of the votes become 
more equitable. As a result, when a term has a small voting body, reputation is very important; 
this allows good terms to be promoted quickly and bad terms to be deprecated quickly. As the 
voting body increases a reputation loses significance. Reputation is used as a heuristic for 
consensus; and, therefore, the score becomes more equitable as the number of people with an 
opinion grows. 





7 Christopher Patton’s Blog is part of the Bi-level Metadata Registry Development project, DataONE 2013 
Summer Internship program; see: https://notebooks.dataone.org/metadata-registry. 


Proc. Int’l Conf. on Dublin Core and Metadata Applications 2013 


TABLE 2. Term Classes and Voting Impact 





Vernacular > canonical -- term is stable after two days and consensus is above 








75%. 

Vernacular — deprecated -- term is stable after two days and consensus is 
below 25%. 

Canonical > vernacular -- term has been updated, restabilized, and consensus 
has dropped below 75%. 





Deprecated > vernacular -- term has been updated, restabilized, and 
consensus has risen above 25%. 











Conclusion 


Duplicative metadata efforts are not cost effective and require attention. Sealce, a 
crowdsourced metadictionary, may help address this challenge and the disorder stemming from 
growing number of metadata schemes. Sealce is in a development stage, and PAMWG members 
are experimenting with crowdsourcing metadata terms and definitions. Next steps include 
broadening participation and engaging others to experiment with Sealce. The CAMP-4-DATA 
aims to “explore infrastructure design, applications, and policies that can advance the support of 
open, collective and sustainable access to metadata standards used for managing scientific data.” 
The Sealce application fits this call, and DataONE PAMWG members welcome to opportunity to 
present Sealce at the CAMP-4-DATA. We outline three key objectives for participants: 


e Test the Sealce application by entering a term(s) 
e Test the voting mechanism for Sealce by voting on a term(s) 
e Engage in an open discussion with DataONE PAMWG members at the CAMP-4DATA. 
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